{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "# Entity Service Similarity Scores Output\n", "\n", "This tutorial demonstrates generating CLKs from PII, creating a new project on the entity service, and how to retrieve the results. \n", "The output type is raw similarity scores. This output type is particularly useful for determining a good threshold for the greedy solver used in mapping.\n", "\n", "The sections are usually run by different participants - but for illustration all is carried out in this one file. The participants providing data are *Alice* and *Bob*, and the analyst is acting as the integration authority.\n", "\n", "### Who learns what?\n", "\n", "Alice and Bob will both generate and upload their CLKs.\n", "\n", "The analyst - who creates the linkage project - learns the `similarity scores`. Be aware that this is a lot of information and are subject to frequency attacks.\n", "\n", "### Steps\n", "\n", "* Check connection to Entity Service\n", "* Data preparation\n", " * Write CSV files with PII\n", " * Create a Linkage Schema\n", "* Create Linkage Project\n", "* Generate CLKs from PII\n", "* Upload the PII\n", "* Create a run\n", "* Retrieve and analyse results" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import json\n", "import os\n", "import time\n", "\n", "import matplotlib.pyplot as plt\n", "import requests\n", "import clkhash.rest_client\n", "from IPython.display import clear_output" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Check Connection\n", "\n", "If you are connecting to a custom entity service, change the address here." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing anonlink-entity-service hosted at https://testing.es.data61.xyz\n" ] } ], "source": [ "url = os.getenv(\"SERVER\", \"https://testing.es.data61.xyz\")\n", "print(f'Testing anonlink-entity-service hosted at {url}')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"project_count\": 2115, \"rate\": 7737583, \"status\": \"ok\"}\r\n" ] } ], "source": [ "!clkutil status --server \"{url}\"" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Data preparation\n", "\n", "Following the [clkhash tutorial](http://clkhash.readthedocs.io/en/latest/tutorial_cli.html) we will use a dataset from the `recordlinkage` library. We will just write both datasets out to temporary CSV files.\n", "\n", "If you are following along yourself you may have to adjust the file names in all the `!clkutil` commands." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "from tempfile import NamedTemporaryFile\n", "from recordlinkage.datasets import load_febrl4" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/html": [ "
\n", " | given_name | \n", "surname | \n", "street_number | \n", "address_1 | \n", "address_2 | \n", "suburb | \n", "postcode | \n", "state | \n", "date_of_birth | \n", "soc_sec_id | \n", "
---|---|---|---|---|---|---|---|---|---|---|
rec_id | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
rec-1070-org | \n", "michaela | \n", "neumann | \n", "8 | \n", "stanley street | \n", "miami | \n", "winston hills | \n", "4223 | \n", "nsw | \n", "19151111 | \n", "5304218 | \n", "
rec-1016-org | \n", "courtney | \n", "painter | \n", "12 | \n", "pinkerton circuit | \n", "bega flats | \n", "richlands | \n", "4560 | \n", "vic | \n", "19161214 | \n", "4066625 | \n", "
rec-4405-org | \n", "charles | \n", "green | \n", "38 | \n", "salkauskas crescent | \n", "kela | \n", "dapto | \n", "4566 | \n", "nsw | \n", "19480930 | \n", "4365168 | \n", "